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(54) INFORMATION RETRIEVAL UNIT 

(57)Abstraot 

PROBLEM TO BE SOLVED: To obtain an information 
retrieval unit for calculating a similarity degree which 
reflects relation between keywords and improving 
precision in classification or retrieval 
SOLUTION: The unit is provided with a document 
database 10 storing multiple kinds of document data a 
vector generating means 20 for generating the feature 
vector of the keyword concerning each kind of document 
data, a classifying means 30 for calculating the similarity 
degree between the feature vectors and classifying 
document data and an output means 40 for outputting 
the classification result of document data. The vector 
generating means 20 analyzes the respective kinds of 
document data, extracts the keywords and relation 
between the keywords and generates the feature vector 
based on the appearance frequency of the both. 




* NOTICES * 



JPO and 1NPIT are not responsible for any 
damages caused by the use of this translation. 

1. Thts document has been translated by computer. So the translation may not reflect the original 
precisely. 

2. **** shows the word which can not be translated 
3Jn the drawings, any words are not translated. 



CLAIMS 

[Claim(s)] 

[Claim 1]A document data base which stores two or more document data. 

A vector generating means which generates a feature vector to each above-mentioned 

document data. 

A sorting means which calculates similarity between the above-mentioned feature vectors, and 
classifies each above-mentioned document data. 

An output means which outputs a classification result of the above-mentioned document data. 
It is the information retrieval device provided with the above, and the above-mentioned vector 
generating means analyzes each above-mentioned document data respectively, extracts a 
relation between keywords, and generates the above-mentioned feature vector based On both 
frequency of occurrence of these. 

[Claim 2]A document data base which stores two or more document data. 
A search formula input means which inputs a search formula. 

A vector generating means which generates a feature vector to each above-mentioned 
document data and the above-mentioned search formula. 

A similarity calculation means to calculate similarity between a feature vector to the above- 
mentioned search formula, and a feature vector to each above-mentioned document data. 
An output means whioh outputs document data which has the above-mentioned high feature 
vector of similarity. 

It is the information retrieval device provided with the above, and the above-mentioned vector 
generating means analyzes respectively each above-mentioned document data and a search 
formula, extracts a relation between keywords, and generates the above-mentioned feature 
vector based on these frequencies of occurrence. 

[Claim 3]The information retrieval device according to claim 1 or 2. wherein a relation of 
dependency is used for the above-mentioned vector generating means as a relation between the 
above-mentioned keywords. 

[Claim 4]The information retrieval device according to claim 1 or 2 using that the above- 
mentioned vector generating means has a near distance between keywords as a relation between 
the above-mentioned keywords. 

[Claim 5]The above-mentioned vector generating means instead of the frequency of occurrence 
of a relation between keywords containing a keyword contained in a keyword group belonging to 
the same category, or it The information retrieval device according to any one of claims 1 to 4 
using what added those frequencies of occurrence, respectively as the frequency of occurrence 
fj- ^ betwoen k «ywords containing a keyword or it representing the category. 
[Claim 6]The information retrieval device according to any one of claims 1 to 5, wherein the 
above-mentioned vector generating means generates a feature vector based on weighting which 
a user specifies to the frequency of occurrence of a relation between keywords. 
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DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Field of the Invention]This invention relates especially document data to the information 
retrieval device which carries out classification and search automatically about a classification 
and search of the electronized document data. 
[0002] 

[Description of the Prior Art]About a classification and search of the electronized document 
data, the information retrieval device shown in the former, for example, JP.11-110395A is 
proposed. In the information retrieval device proposed here, the synonymous frequency of 
occurrence is summarized, a feature vector is generated, the similarity between feature vectors 
is calculated, and each document data is classified Attaching weighting to two or more words 
which have a synonymous relation, respectively is also proposed by this JP.11-110395A 
[0003]While summarizing the synonymous frequency of occurrence in JP.1 0-1 98691 A and 
generating a feature veotor is indicated, the adjoining word pair in a document data base and the 
synonym pair are registered, for example, and using for calculation of a feature vector is 
indicated. 
[0004] 

[Problem(s) to be Solved by the Invention]In the conventional information retrieval device of 
such composition, therefore the dependency of words and phrases, etc. had not carried out the 
classification or search reflecting the relation between keywords, a high-precision classification 
or search was not able to be carried out 

[0005]This invention was made in order to solve above SUBJECT, and it makes possible 
similarity calculation not only reflecting a keyword but the relation between keywords, and an 
object of an invention is to obtain the information retrieval device which can improve the 
aocuracy of a classification or search. 
[0006] 

[Means for Solving the Problem]A document data base with which an information retrieval device 
concerning this invention stores two or more document data. In an information retrieval device 
which has a vector generating means which generates a feature vector to each document data, a 
sorting means which calculates similarity between feature vectors and classifies each document 
data, and an output means which outputs a classification result of document data, A vector 
generating means analyzes each document data respectively, extracts a relation between 
keywords, and generates a feature vector based on both frequency of occurrence of these. 
[0007]A document data base with which an information retrieval device concerning this invention 
stores two or more document data, A search formula input means which inputs a search formula, 
and a vector generating means which generates a feature vector to each document data and 
search formula, In an information retrieval device which has a similarity calculation means to 
calculate similarity between a feature vector to a search formula, and a feature vector to each 
document data, and an output means which outputs document data which has a high feature 
vector of similarity, A vector generating means analyzes each document data and a search 
formula respectively, extracts a relation between keywords, and generates a feature vector 
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number of paragraphs, etc. can be considered, for example. The case where a direction is 
considered also in this case may not be considered. As said 1F\j concerning the relation between 
the keywords in this case, or DFj, appearance frequency when this distance is smaller than a 
user designated value will be used, for example. 

[0022]!t is also possible to perform the classification reflecting the category of keywords, such 
as synonymous-words-related. Namely, if it explains taking the case of the relation of 
dependency, supposing the keyword aO and al belong to the category A and aO and a1 have the 
relation between b and dependency. The keyword aO, the dimension about al. "aO". "aO->b", 
a1", and "a1->b" can be summarized to "A" and "A->b". and a feature vector can also be 
generated. 

[0023]If neither of each dimension of the feature vector used as the comparison object in the 
case of a classification is a non-zero ingredient (coincidence), it will not contribute to similarity. 
However, generally, since coincidence of the relation between keywords becomes low probable 
rather than coincidence of a mere keyword, compared with a keyword, there Is a tendency for 
the contribution to the similarity of the relation between keywords to become low. Then, balance 
of contribution to both similarity evaluation can be aimed at by making dignity of the dimension 
about the relation between keywords larger than the dimension of a keyword. 
[0024]It is also possible for a user to select a keyword, to enlarge dignity of the dimension of the 
relation between the keywords in which the dimension of the keyword and its keyword are 
contained, and to perform the classification which thought as important the keyword which a 
user observes. 

[0025]Embodiment 2. drawing 2 is a block diagram showing the example of composition of the 
information retrieval device about search which are other examples of this invention, and the 
document data base 10 and the output means 40 have the same function as drawing 1 . 
[0026]Have the search formula input means 50 and the function to input a search condition as a 
search formula (the text expressing a search formula may be sufficient) the vector generating 
means 20, About the whole sentence document stored in the document data base 10, generate a 
feature vector from the frequency of occurrence of the relation between keywords according to 
the feature vector formula described by Embodiment 1. for example, and. It has a function which 
generates a feature vector also from the inputted search formula using the feature vector 
generation method described by Embodiment 1. and the same method. However, when generating 
a feature vector from a search formula, i of said Vtf shall not show the document i and shall show 
a search formula. In this case, TFli is usually set to 1. 

[0027]The search means 60 the similarity between the feature vector generated from the search 
formula, and the feature vector about the whole sentence document in the stored document 
data base 10. It calculates in the method described by the means of the classification of 
Embodiment 1, and a similar way, and similarity ranking evaluation of the document in a 
document data base is performed using the result 

[0028]The result which carried out ranking attachment can be outputted by the output means 
40. 

[0029]By the vector generating means 20, it is possible to use the relation which a keyword 
requires, the number of things with a near distance between keywords, etc. the same [ with 
having stated by Embodiment 1 ] as an example of the related extraction between the keywords 
at the time of generating a feature vector to each document data. 
[0030]rt is also possible to perform search reflecting the category of keywords, such as 
synonymous-words-related. How to the category of the relation between keywords to collect is 
the same as that of illustration of Embodiment 1. 

[0031]By what dignity of the dimension of the relation between specific keywords is enlarged for 
It is also possible to aim at balance of the dignity of each dimension of the relation between 
keywords, or to perform search which thought as important the relation between the keywords 
which a user observes like illustration by Embodiment 1 
[0032] 

[Effect of the Invention]The document data base with which the information retrieval device 
concerning this invention stores two or more document data. In the information retrieval device 



which has a vector generating means which generates a feature vector to each document data a 
sorting means which calculates the similarity between feature vectors and classifies each 
document data, and an output means which outputs the classification result of document data A 
vector generating means analyzes each document data respectively, extraots the relation 
between keywords, and generates a feature vector based on both frequency of occurrence of 
these. Therefore, in a classification of document data, the similarity calculation not only 
reflecting the keyword of each document data but the relation between keywords becomes 
possible, and accuracy improves. 

[0033]The document data base with which the information retrieval device concerning this 
invention stores two or more document data. The search formula input means which inputs a 
search formula, and the vector generating means which generates a feature vector to each 
document data and search formula, In the information retrieval device which has a similarity 
calculation means to calculate the similarity between the feature vector to a search formula, and 
the featu ™ vector to each document data, and an output means which outputs the document 
data which has a high feature vector of similarity. A vector generating means analyzes each 
document data and a search formula respectively, extracts the relation between keywords and 
generates a feature vector based on these frequencies of occurrence. Therefore, in search of 
the document data near a search formula, the similarity calculation not only reflecting the 
keyword which appears in a search formula and each document data but the relation between 
keywords becomes possible, and the accuracy of search improves 

EX™^!°f n ° f i depe " d8n £ y !f, U8ed for 8 vector generating means as a relation between 
keywords. Therefore ha classification of document data or search of the document data near a 
search formula similarity calculation reflecting the relation of dependency is performed and the 
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C0035]It is used for a vector generating means as a relation between keywords that the distance 
between keywords is near. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation reflecting the distance between 
keywords « performed, and the accuracy of a classification or search improves compared with 
the conventional method only using a keyword. 

[0036]Instead of the frequency of occurrence of the relation between the keywords in which a 
vector generating means contains the keyword contained in the keyword group belonging to the 
same category, or it. What added those frequencies of occurrence, respectively is used as the 
frequency of occurrence of the relation between the keywords containing the keyword or it 
representing the category. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation which summarized the relation 
between the keywords which do not need distinguishing for a user can be performed. As a result 
search ^ 6886 efficiency of ***** *> r °™* classification and 

S A £ nerating "I 03 " 8 generates a feature vector based on weighting which a user 

specifies to the frequency of occurrence of the relation between keywords. Therefore in a 
tSSH^S IT*? ^ ° r Se r h ° f the d0CUment data near 8 sefl rch formula, a user's 
relation between i specific keywords can be performed. As a result highly preciseHzation of a 
classification and search in the form which reflected a user's intention LttHV^Ted 
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TECHNICAL FIELD 



[Raid of the Inventlon]This invention relates especially document data to the information 
retrieval device which carries out classification and search automatically about a classification 
and search of the electronized document data. 
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PRIOR ART 



[Description of the Prior Art]About a classification and search of the electronized document 
„_ *. T*™*?™ retrieval device shown in the former, for example, JP.11~110395,A is 
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proposed. In the information retrieval device proposed here, the synonymous frequency of 
occurrence is summarized, a feature vector is generated, the similarity between feature vectors 
is calculated, and each document data is classified. Attaching weighting to two or more words 
rftHMTuSI? 6 8ynonymous relation - respectively is also proposed by this JP.11-110395A 
L0003JWhile summarizing the synonymous frequency of occurrence in JP.10-1 98691 A and 
generating a feature vector is indicated, the adjoining word pair in a document data base and the 
synonym pair are registered, for example, and using for calculation of a feature vector is 
indicated. 
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EFFECT OF THE INVENTION 



[Effect of the Invention]The document data base with which the information retrieval device 
concerning this invention stores two or more document data. In the information retrieval device 
which has a vector generating means which generates a feature vector to each document data, a 
sorting means which calculates the similarity between feature vectors and classifies each 
document data, and an output means which outputs the classification result of document data. A 
vector generating means analyzes each document data respectively, extracts the relation 
between keywords, and generates a feature vector based on both frequency of occurrence of 
these. Therefore, in a classification of document data, the similarity calculation not only 
reflecting the keyword of each document data but the relation between keywords becomes 
possible, and accuracy improves. 

[0033]The document data base with which the information retrieval device concerning this 
invention stores two or more document data, The search formula input means which inputs a 
search formula, and the vector generating means which generates a feature vector to each 
document data and search formula, In the information retrieval device which has a similarity 
calculation means to calculate the similarity between the feature vector to a search formula, and 
the feature vector to each document data, and an output means which outputs the document 
data which has a high feature vector of similarity, A vector generating means analyzes each 
document data and a search formula respectively, extracts the relation between keywords, and 
generates a feature vector based on these frequencies of occurrence. Therefore, in search of 
the document data near a search formula, the similarity calculation not only reflecting the 
keyword which appears in a search formula and each document data but the relation between 
keywords becomes possible, and the accuracy of search improves. 

[0034]The relation of dependency is used for a vector generating means as a relation between 
keywords. Therefore, in a classification of document data or search of the document data near a 
search formula, similarity calculation reflecting the relation of dependency is performed and the 
accuracy of a classification or search improves compared with the conventional method only 
using a keyword. 

[0035]It is used for a vector generating means as a relation between keywords that the distance 
between keywords is near. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation reflecting the distance between 
keywords is performed, and the accuracy of a classification or search improves compared with 
the conventional method only using a keyword. 

[0036]Instead of the frequency of occurrence of the relation between the keywords in which a 
vector generating means contains the keyword contained in the keyword group belonging to the 
same category, or it. What added those frequencies of occurrence, respectively is used as the 
frequency of occurrence of the relation between the keywords containing the keyword or it 
representing the category. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation which summarized the relation 
between the keywords which do not need distinguishing for a user can be performed. As a result, 
it becomes possible to attain the increase in efficiency of highly precise classification and 



[0037]A vector generating means generates a feature vector based on weighting which a user 
specifies to the frequency of occurrence of the relation between keywords. Therefore in a 
classification of document data or search of the document data near a search formula, a user's 
intention is reflected and similarity calculation which thought as important or made light of the 
relation between specific keywords can be performed. As a result highly preciseHzation of a 
classification and search in the form which reflected a user's intention better is attained 
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TECHNICAL PROBLEM 



[Problem(s) to be Solved by the lnvention]tn the conventional information retrieval device of 
such composition, therefore the dependency of words and phrases, etc. had not carried out the 
classification or search reflecting the relation between keywords, a high-precision classification 
or search was not able to be carried out 

[0005]This invention was made in order to solve above SUBJECT, and it makes possible 
similarity calculation not only reflecting a keyword but the relation between keywords, and an 
object of an invention is to obtain the information retrieval device which can improve the 
accuracy of a classification or search. 
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MEANS 



con e cem^ltt!l- P ~ b,e ^A document data base with which an information retrieval device 
3Ehl7JSlT ent, ° n ^ 0ros tW0 or ™ r * d °^ent data, In an information retrieval device 
which has a vector generating means which generates a feature vector to each darnJ^^Z - 
sorting means which calculates similarity hereon featore vector^^ 
data, and an output means which outputs a classification result of SocumerS da^ A vecTr 
generating means analyzes each document data respectively, extracts a Tre to ^between 
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C0007]A document data base with which an information retrieval device concerning ^Ws invention 
stores two or more document data. A search formub input means which inpulTsea^ch ? 
and a vector generating means which generates a feature vector to each dooTment date and 
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[0010]Instoed of the frequency of occurrence of e rotation between keywords in which e vector 
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ingredrent V, of each dimension j of the feature vector Vi can be computed by^e following 



formula, for example. 
[0015]Vij=TFu*log(N/Drj) 

[0016]Here, TRj is [ be / it / under / document i / setting ] the number of times in which the 
relation between the keywords corresponding to j ingredient appears, and Drj is the number of 
times in which the relation between the keywords corresponding to j ingredient appears in N 
whole sentence in the letter of the document data base 10, Thus, a feature'vector is generated. 
[0017]In the sorting means 30, the similarity between documents is calculated and a document is 
clustered using the result The similarity between documents is calculable with the cosine value 
of the angle between each feature vector of two or more sentence document computed as 
mentioned above, for example. By using the similarity calculated about clustering using the 
feature vector generated as mentioned above to similarity calculation required for clustering 
algorithms, such as K method of averaging, Clustering not only reflecting the keyword used by 
the conventional clustering but the relation between keywords is attained 
[0018]The classified result can be outputted by the output means 40. 

[0019]Here, what have the relation of the dependenoy obtained as a result of syntax analysis and 
the distance between keywords near as an example of the related extraction between the 
keywords at the time of generating a feature vector to each document data can be considered 
by the vector generating means 20. 

[0020]First the sentence "0 A carries out B" is considered about the relation of dependency, 
for example. In tiiis sentence, the relation of the dependency of "C A carrying out" and "C 
Carrying out B exists. Although it may identify including these to a rank, a rank being 
disregarded, and "A->C. "B->C, or a direction also being disregarded, and considering it as 
A&C and B&C (it considers that "A->C and "C->A" are the same) is also considered. The 
appearance frequency of such dependency will be used as said TFij concerning the relation 
between the keywords in this case, or DFj. 

[0021]As an example of the distance in the case of on the other hand using what has a near 
distance between keywords as a relation between keywords, the number of characters between 
keywords, the number of morphemes, the number of clauses, the number of sentences, the 
number of paragraphs, etc. can be considered, for example. The case where a direction is 
cons.dered also in this case may not be considered. As said TFij concerning the relation between 
the keywords in this case, or DFJ. appearance frequency when this distance is smaller than a 
user designated value will be used, for example. 

[0O22]It is also possible to perform the classification reflecting the category of keywords such 
as synonymous-words-related. Namely, if it explains taking the case of the relation of 
dependency, supposing the keyword aO and al belong to the category A and aO and a1 have the 
relation between b and dependency, The keyword aO, the dimension about al, "aO". "a0->b", 
al .and a1->b can be summarized to "A" and "A->b". and a feature vector can also be' 
generated. 

[0023]If neither of each dimension of the feature vector used as the comparison object in the 
case of a classification is a non-zero ingredient (coincidence), it will not contribute to similarity 
However, generally, since coincidence of the relation between keywords becomes low probable 
rather than coincidence of a mere keyword, compared with a keyword, there is a tendency for 
the contribution to the similarity of the relation between keywords to become low. Then, balance 
of contribution to both similarity evaluation can be aimed at by making dignity of the dimension 
about the relation between keywords larger than the dimension of a keyword. 
[0024]It is also possible for a user to select a keyword, to enlarge dignity of the dimension of the 
relation between the keywords in which the dimension of the keyword and its keyword are 
conta.ned, and to perform the classification which thought as important the keyword which a 
user observes. 

[0025]Embodiment 2. drawing 2 is a block diagram showing the example of composition of the 
information .retrieval device about search which are other examples of this invention, and the 

ron^ and thB 0UtpUt means 40 the same fr™*™ aa drawing 1 . 

L0026JHave the search formula input means 50 and the function to input a search condition as a 
search formula (the text expressing a search formula may be sufficient) the vector generating 



means 20. About the whole sentence document stored in the document data base 10. generate a 
feature vector from the frequency of occurrence of the relation between keywords according to 
the feature vector formula described by Embodiment 1. for example, and. h has a function which 
generates a feature ^vector also from the inputted search formula using the feature Sector 
lT*TZr* t eSCnb<,d S Emb ? diment *• and *. same method. However, when generating 
1 rj Search T f0 .™ ula - 1 <« ™° Vu shall not show the document i and shall show 

a search formula. In this case. TRj is usually set to 1 

EL^nH th Ch f 6 °. the b6tWeen the feature vector crated from the search 

date base !0 h ,!l * Z about L ^f^ole sentence document in the stored document 

iThnZVrl C !, IC ^ t9 f the meth0d described ^ *• means * classification of 
Embodiment 1. and a similar way. and similarity ranking evaluation of the document in a 
document data base is performed using the result 

[0O28]The result which carried out ranking attachment can be outputted by the output means 

[0029]By the vector generating means 20. it is possible to use the relation which a keyword 
requires^the number of things with a near distance between keywords, etc. «Tsame fwrth 

atTe%^e of ten^ 0dim ? 1 3 88 ^ eXtrac «°" between ^ Cords 

tlme of generating a feature vector to each document data. 
[0030]It is also possible to perform search reflecting the category of keywords, such as 

r s =si.r^^ 

[0031]By what dignity of the dimension of the relation between specific keywords is enlenced for 
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DESCRIPTION OF DRAWINGS 

[Brief Description of the Drawings] 

[DrawirtK 1]I t is a block diagram showing the information retrieval device relevant to the 
classification of this invention. 

[ DrawinR 2]K is a block diagram showing the information retrieval device relevant to search of 
this invention. 
[Description of Notations] 

10 A document data base, 20 vector generating means, and 30 A sorting means, 40 output 
means, and 50 A search formula input means and 60 Similarity calculation means (search means). 
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[Drawing 21 




[Translation done.] 



#082002 -245067 
(P2002- 245067A) 

. , (43)ftWB ¥#14^8 3308( 2002. 8.30) 

<5i>intcr w&mn f i 

G0 6F 17/30 2 1 0 G0 6F 17/30 

170 
340 
360 



210D 5B075 
170A 
340B 
350C 



(22)tf«0 



»«2001-37163(P200l-37163) 
¥f*13*2J!MBC200L2.M) 



aw pro**, asaai* rfc*iFs&A&tf*K 



«Mfc* #11* »*JJfcOR6 OL 5 H) 
(7DHWA 000006013 

JWCfcTftBRfUDftrTB 2#3# 

<72)»«# xm 
(72)hw« mm m-m 

*(C« s mEBB3t«0rt-TB2#3^ = 
<74)ftSA 1000S7874 



(57) [g»J 

tSflHH*£8B&f§S. 

?S2 0«, #3W7»-#*#4r«WfLT+-7-HR 



immmm 

&*©±itt«T-* causa** Mttr^ 

IXZ-V-Ymi-V-mmmmi. Ctl 

zmfimwrnicg^xiEm^ hkzmt 

tat, cn&(Dffli«aics^v>T±3a^a^^ 

5 c t &m 1 1 zmmm. 
mm3i vh&m-mt, ±k*-7- 

tinm 1 ski* 2 {cse«<oisfa«?8SB. 

4 ] ±12^* ±£*-7- 

5 u fc*W«fc*V6Bfi»H 1 Sfctt 2 lce«ol»R«UR 
SB. 

fc«t*+-7-K»K**hS*-7- Ffc L< tt* 
-7- FBIOBgfficDffl^gfc LT*ft6«fflS«S* 

tf*-7-FMOB8ffOffla««lc»L, fOTftflgJE 

BWIOSfflftSWI] 
[000 1] 

[3?E©S*Sgffi#SJ] coJMItt, WFftSftfc* 



8B82 002-2 4 506 7 
2 

£#8 • 8^** 518$8ftj&3flk:lVI?- 5 fc<DTfcS 0 
[0002] 

135 Lm mum* 1 1 - 1 1 0 3 9 

fcimttix^zmmmtctivxits mm\a 
ms&^ttbxm^b)^±&,u murhb 

ut0t$pg¥ 1 1-1 1 0 3 9 5% v «fctt, 8HUB0NI 
[0 0 0 3] £5lC> BUtf#!58¥l 0-1 98 69 1 

5. 

[0004] 

ommmmic&wu mmv&tt^ +-7 

-Fm©H«*K«Lfc$ffl**iMi««$LT^ft 
[0 0 0 5] COfffitt. J&Q**ftM«j!t&t« 

[0006] 

*t)i£mmt, mubjwommmnix 
x, ** vi^mmt, &*MT-z*&*mLx 

*-7-FSt;*-7-F[8JOl3!l^ttfflL, Cfte>W 
[0 0 0 7] ifc, C0»HCft«1*!RftStEff& a 

LT+-7-FStf*-7-FWOHfl5*ttfUL» OX 
[0 0 0 8] tk, «*hfi&a&m* +-7-FM 



3 

[0 0 0 9] tit, OYfrSffimiy +-7-FKI 

[ooio] 35ft. «9hb&a&m* Bh-*r=ry 

(Cflf5*-7- KBfc3Sti**-7- Ft L < tt* 

©*T3"y4ft£TS*-7- Ft L< ItZtlZSti* 
-7- FMONftOttgMg LT*tt&©ffl3IJS&4 

[00 11] ^h^£jCf>Stt % *-7- K 

[00 12] 

aiCfe^TiSr-^-X 1 0 

*<-**ttifrr*. itr-^-xi ojcfttt^nft 
s. 

[0 0 13] h;l>£*?R2 Ott. #&3t-*IC 2 
»LT»flK*hA/4fl&W*. *4fe$, ^ilx- 

[0 0 14] &tc, NfflOAft^Bft***^-*^- 
XI 0±ftfrS, *-7-FKffl. *-7-FW©W« 

RiB^tttfj^nftt^ #*m cisisN) om 

'^M'Vltt, ftfcAtf, K + RJ»{7EO^h;|/-ea 
SftS. *-7-FgL<li*-7-F|tfl©M&©>|':/ 

r-y^x^j (isjsk+r) 3 

F/VV 1 ©S#tcJ ©j«#V 1 J (i, ftfc*tf, t f • 

[0015] VI J=TF1 J * I a g (N/DFj) 
[0 0 16] T F I J 14, X*l*fcfel,vr> 

J *#fcflJW**-7- KSb< tt*-7- FH©H 

««Hn*i3Brc*y, Sfc. dfj«, *«t-*-s 
-x 1 0©Nffl©&*aM>ic*j,vc, j miztt&tz 
*-7- K*L < tt*-7- FHOH«HlWi«BKT 

[0017] #s#a3 0T{i, j»n<Dg»ua«itj9[ « 

T5i:#JC, *OJ8JR*ft5T. X«©$r?X*y;/$r 
*ff3. £Bf«©8|{KS»i. ftfc*ff±82©<fc5fcgiii 

Lftaisfctsos^fflt'S^ h/naiofta<03-9-i'^ii-e 

ttJtt?*5. ^7X^UvyiCOl>Tli, K¥$i£4if 

©*?x* y y^r* XAfcMajuwmji*:, 

6fra*fe*-7- Ffttf T?4 < . *-7- HRBOBM 

*tsKi/ft^7x*y>y*<Brggfc*3„ 

[0 0 18]^Uft^ttffl^S4 0teJ;y^t 50 



815820 02-2 4 5 06 7 
4 

«C 

[0 0 19] CCT\ ^h/wsa^azoT, 
T^Kif LTftSK* F*44*r*IB©*-7-F 

tfSn«fl&9»t©KH^*-7-FIH©JI*©3fi^fc 
OftaWWtMl*. 

[0020] mr % «ya»on«ico^T, ftt* 
if, rA#B*ct*j crostfc 

^Ttt, TAA^CtSj rB^CtSj tV^flR»J*J 
10 Olfmfft. cn6*tt*T#»T«»JLTtJ: 
IW, «**aLT, TA-»CJ , TB-»CJ 
tt» EfiltdltHt/C, TA&CJ, TB&CJ ( TA-» 

cj t rc-Aj zmtfi%+) ttscti^e, 

tl5. C<0ig^+-7-KffflOM«{C«5, tOIST F 

[0 0 2 1] *-7-FR5©M&i:LT, +-7 
LTtt, ftfc*tf+-7-Kiai03»», JgffijRft, A 

» »», sat jBmnw*ft&ns. coasts* 
*««afc**4t*te#&fiEr*. c©t§£©*- 

7- FNOBOffilCftt, ilflETF I J, gL<l±DF J 

turn w*«, co»i*«3-yHiBettj:»)/hs 
[0022] sejc, n«aniiR4£:o^-7~Ho» 

rJy*K«tft»a4lf5ci:fei>Ili-p»*. *4fc 

15, mzitms&mtoxmttuSs 

FaO, Bl*»*^sryAlC«LT*0, aO, a Hi 
bfc«0**©IHflW**i:1-fttf. *-7-Fa0, 
0 a lteBG*3&7C, TaOj, r a 0-+bj, r a 

U, r a i-»bj*, taj, rA->bj »c, St* 

[0 0 2 3] 4fe, ^O^OJt««»fc45RFa^^ 

««fctt»4U4v^ Lfrlitm, HSK, +-7- 
FIBOBBRO^ett, *43*~7- K©«fi J: 5 fe« 
*WfcflS<4«ft». *~7-KfCtt^T, *-7-F 

) *-7- FO^Tt J: 5 t^c* < 1 5 C t{c J; 0 > Pf#© 

[0 0 2 4] $ft, a-y-*«*-7-K*aSLT, 
^co^-7- KO«J»t©*-7- WMflS*- 
7-KW©BD«©*7E©t*4^t<tT, a— !f-tf 

[002 5] 3|]i©JB&2. 02tt, C©S^©<tt©S| 
^lcKf5««fi|jg«H©«l/«glJ4^f 
•fuymx-fr*), XS-r-^-S-Xl 0, ffl^$S4 

ottHi ∋flttg«frt«. 



5 

[0026] m&ktims ott, mmzw& 

l 0lc«ttSftri»*±A»eov^T. +-7-FR0 

+-7-HiBoBB«oajaJii«*5. mn> mm 
h;v*4«t5i:Wu, A*Lfc*jiac*»6t, ago 
v>t, m«t hA>*ss*t*m&mt*, mis 

<Dtt*. C©«£, TF 1 Jttltf ) 

co 02 7] mmeoit. m£tf>z£&Lrcm 

^a©¥RT^fcj«£i:mi«4*ttT|t*u *© 
ISSfc&^T, £«r-*^-*$©;jr3©a<E(g7>< 

[0 0 2 8] 9y*>yftHJtfttS!Rttffl*¥a4 0te 
[0 0 2 9] **M«jft9JR2 0Tli, 

f-^lcttLTWa'** h;M?£#l-««©*-7-F 
NOBQffiji&tflO AflcW LT, IdKOSfll 1 X'ft^tcC 
fcfcnglc. *-7-F©ff&f*tt©H«**-7-K 

*. 

[0 0 3 0] SSfC, roaSHfttt£Q*-7-F©* 

- pet < tt*~7- Kiaioi»fi5o*T=ry-'\os t 

[003 1] Sft, ft£©*-7-K£L<(i*-7- 

S8 1 TOflSwi:W«fc, *-7- K/+-7- Fnon 

V-jVttBt*+-7- H»b< tt*-7- FIVOBSft 
*iaLfcttJR*ff^CktBJ||6T*5. 
[003 2] 

irnmm zommzmmmu. «»© 

+-7-FIBOBBff«tttUL. tilJBJSSf c 

*©#aic*j^T\ *3»7«-*o*-7-F»feft 

<, *-7-FMOB8«*tEI»LfcSI«Sttl|[jitpjfi| 

[0033] sfc. c tumsRaBti. a ! 



8M2002-245067 
6 

fc. «aKK»t*Wf«^* F;Vfc&4©£8T-*lC 

tD<ommm^x^m^9 !»*«£«*-«. *© 

**tf*X«f-*KaaT5*-7-Fe»Ta:<, 
*-7- FN©nfl%6SttLfcaflMft*«n9afcft 

[0 0 3 4] ^»h^t«?Stt» +-7-FM 
T. ft&3»*©Nffi*fi3ftlfcatiKfta>WTtoti. * 

-7- K©**ffl^tett*o» a Ktt^a^ M58oa 

[0 0 3 5] *fc. «*hMU&mL +-7-FBB 
OttWcfctvc, *-7-KIHOBI|*K»LftBHttt 

ttJWffiatu *-7-Fowt«t*ft8aEo»acttt 
*^**a©»«jww:**. 

[0 0 3 6] Sfc, ^MI^RfRtt, IHM'J 
fcatS*-7- F»fc3Sti**-7- Ffc L < tt* 

n*s&*-7- FBcDiMoiianttottb o t, * 
©*r=ry *«« , r*+-7- Ft, t < »*n««tt* 

-7- FIS!©M&©ffigl$£fc LT*nsottJS«B* 

-7- FRI©Hffi«S Afc»HSft»*:ff* C t ff-p 
§5. S«S©^a.^©^$fk^|2l5c 
tOTJIttft*. 

[0 0 3 7] +-7-F 

atf+-7-FH©BB«©ma«atfc:s*u 

- F*5^tt*-7- Fn©B8ffi«a«««^tt1iaL 
fca<H*lt»*ff5<:i:3VT*5. fomu fJfflg© 

«^ct mbmxvm • aaoamafb«< 
caa©na«ais] 

[Ian e:©R^©»atMat5fsa^sB*^ 
[02] coj»!oMRiuna-r«affiaaiii«^ 



10 ast-^^-x. 20 "^h/wtam 3* 



j) 8082 0 0 2-2 4 50 6 7 

8 

*o »am 4om/jm so a^su^m 

6 0 8«Ufi3tV¥Bl Ct^S) o 



[023 



(7Z)5SE£ 'J>» S- (72)^^# tea »- 

SCR» ! F«fflK*,Ort-TS2#3^ H *S»WfflE*l0|'jrrB2#3f 

F*-A(£3) 5B075 NW33 NK02 HR12 PP23 PR04 
PR06 QHOS 



PATENT ABSTRACTS OF JAPAN 



(1 1 publication number : 2002-245067 
(43)Date of publication of application : 30.082002 



(51)IntCI. . G06F 17/30 



(2 DApplication number : 2001-037163 (71)Applicant : MITSUBISHI ELECTRIC CORP 
(22)Date of filing : 14.02.2001 (72)Inventor : KONAKA HIROYOSHI 

TSUDAKA SHINICHIRO 
KOBUNE RYUICHI 
ARTTA HIDEKAZU 



(54) INFORMATION RETRIEVAL UNIT 

(57)Abstracf 

PROBLEM TO BE SOLVED: To obtain an information 
retrieval unit for calculating a similarity degree which 
reflects relation between keywords and improving 
precision in classification or retrieval. 
SOLUTION: The unit is provided with a document 
database 10 storing multiple kinds of document data, a 
vector generating means 20 for generating the feature 
vector of the keyword concerning each kind of document 
data, a classifying means 30 for calculating the similarity 
degree between the feature vectors and classifying 
document data and an output means 40 for outputting 
the classification result of document data. The vector 
generating means 20 analyzes the respective kinds of 
document data, extracts the keywords and relation 
between the keywords and generates the feature vector 
based on the appearance frequency of the both. 




* NOTICES * 



JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

LThis document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



CLAIMS 



[Claim(s)] 

[Claim 1]A document data base which stores two or more document data. 

A vector generating means which generates a feature vector to each above-mentioned 

document data. 

A sorting means which calculates similarity between the above-mentioned feature vectors, and 
classifies each above-mentioned document data. 

An output means which outputs a classification result of the above-mentioned document data. 
It is the information retrieval device provided with the above, and the above-mentioned vector 
generating means analyzes each above-mentioned document data respectively, extracts a 
relation between keywords, and generates the above-mentioned feature vector based on both 
frequency of occurrence of these. 

[Claim 2]A document data base which stores two or more document data. 
A search formula input means which inputs a search formula. 

A vector generating means which generates a feature vector to each above-mentioned 
document data and the above-mentioned search formula. 

A similarity calculation means to calculate similarity between a feature vector to the above- 
mentioned search formula, and a feature vector to each above-mentioned document data. 
An output means which outputs document data which has the above-mentioned high feature 
vector of similarity. 

It is the information retrieval device provided with the above, and the above-mentioned vector 
generating means analyzes respectively each above-mentioned document data and a search 
formula, extracts a relation between keywords, and generates the above-mentioned feature 
vector based on these frequencies of occurrence. 

[Claim 3]The information retrieval device according to claim 1 or 2, wherein a relation of 
dependency is used for the above-mentioned vector generating means as a relation between the 
above-mentioned keywords. 

[Claim 4]The information retrieval device according to claim 1 or 2 using that the above- 
mentioned vector generating means has a near distance between keywords as a relation between 
the above-mentioned keywords. 

[Claim 5]The above-mentioned vector generating means instead of the frequency of occurrence 
of a relation between keywords containing a keyword contained in a keyword group belonging to 
the same category, or it, The information retrieval device according to any one of claims 1 to 4 
using what added those frequencies of occurrence, respectively as the frequency of occurrence 
of a relation between keywords containing a keyword or it representing the category. 
[Claim 6]The information retrieval device according to any one of claims 1 to 5, wherein the 
above-mentioned vector generating means generates a feature vector based on weighting which 
a user specifies to the frequency of occurrence of a relation between keywords. 
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DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Field of the Invention]This invention relates especially document data to the information 
retrieval device which carries out classification and search automatically about a classification 
and search of the electronized document data. 
[0002] 

[Description of the Prior Art]About a classification and search of the electronized document 
data, the information retrieval device shown in the former, for example, JP,1 1-1 10395 A is 
proposed. In the information retrieval device proposed here, the synonymous frequency of 
occurrence is summarized, a feature vector is generated, the similarity between feature vectors 
is calculated, and each document data is classified. Attaching weighting to two or more words 
which have a synonymous relation, respectively is also proposed by this JP.1 1-1 10395 A 
[0003]While summarizing the synonymous frequency of occurrence in JP.10— 198691 A and 
generating a feature vector is indicated, the adjoining word pair in a document data base and the 
synonym pair are registered, for example, and using for calculation of a feature vector is 
indicated. 
[0004] 

[Problem(s) to be Solved by the Invention]In the conventional information retrieval device of 
such composition, therefore the dependency of words and phrases, etc. had not carried out the 
classification or search reflecting the relation between keywords, a high-precision classification 
or search was not able to be carried out 

[0005]This invention was made in order to solve above SUBJECT, and it makes possible 
similarity calculation not only reflecting a keyword but the relation between keywords, and an 
object of an invention is to obtain the information retrieval device which can improve the 
accuracy of a classification or search. 
[0006] 

[Means for Solving the Problem]A document data base with which an information retrieval device 
concerning this invention stores two or more document data, In an information retrieval device 
which has a vector generating means which generates a feature vector to each document data, a 
sorting means which calculates similarity between feature vectors and classifies each document 
data, and an output means which outputs a classification result of document data, A vector 
generating means analyzes each document data respectively, extracts a relation between 
keywords, and generates a feature vector based on both frequency of occurrence of these. 
[0007]A document data base with which an information retrieval device concerning this invention 
stores two or more document data, A search formula input means which inputs a search formula, 
and a vector generating means which generates a feature vector to each document data and 
search formula, In an information retrieval device Which has a similarity calculation means to 
calculate similarity between a feature vector to a search formula, and a feature vector to each 
document data, and an output means which outputs document data which has a high feature 
vector of similarity, A vector generating means analyzes each document data and a search 
formula respectively, extracts a relation between keywords, and generates a feature vector 



based on these frequencies of occurrence. 

[0008]A relation of dependency is used for a vector generating means as a relation between 
keywords. 

[0009]It is used for a vector generating means as a relation between keywords that distance 
between keywords is near. 

[0010]Instead of the frequency of occurrence of a relation between keywords in which a vector 
generating means contains a keyword contained in a keyword group belonging to the same 
category, or it, What added those frequencies of occurrence, respectively is used as the 
frequency of occurrence of a relation between keywords containing a keyword or it representing 
the category. 

[001 1]A vector generating means generates a feature vector based on weighting which a user 

specifies to the frequency of occurrence of a relation between keywords. 

[0012] 

[Embodiment of the Invention]Embodiment 1. drawing 1 is a block diagram showing the example 
of composition of the information retrieval device about the classification of this invention. In a 
figure, the document data base 10 stores two or more document data. Each document data 
stored in the document data base TO has text data at least 

[0013]The vector generating means 20 generates a feature vector to each document data. That 
is, conduct a morphological analysis etc. to the text data of each document data, perform 
unnecessary word processing etc. if needed, and a keyword is extracted, and the relation 
between keywords is extracted. 

[0014]Next, when R relations between K keywords are extracted from the document data base 
10 whole which consists of N documents, the feature vector Vi of each document i (1 <=i<=N) is 
expressed with the vector of a K+R dimension, for example. When the index of the relation 
between keywords is expressed with j (1 <=j<=K+R), according to the tf-idf method, the 
ingredient Vjj of each dimension j of the feature vector Vi can be computed by the following 
formula, for example. 
[0015]VO=TFu*log(N/DFj) 

[0016]Here, TFij is [ be / it / under / document i / setting ] the number of times in which the 
relation between the keywords corresponding to j ingredient appears, and DFj is the number of 
times in which the relation between the keywords corresponding to j ingredient appears in N 
whole sentence in the letter of the document data base 10. Thus, a feature vector is generated. 
[0017]In the sorting means 30, the similarity between documents is calculated and a document is 
clustered using the result. The similarity between documents is calculable with the cosine value 
of the angle between each feature vector of two or more sentence document computed as 
mentioned above, for example. By using the similarity calculated about clustering using the 
feature vector generated as mentioned above to similarity calculation required for clustering 
algorithms, such as K method of averaging, Clustering not only reflecting the keyword used by 
the conventional clustering but the relation between keywords is attained. 
[0018]The classified result can be outputted by the output means 40. 

[0019]Here, what have the relation of the dependency obtained as a result of syntax analysis and 
the distance between keywords near as an example of the related extraction between the 
keywords at the time of generating a feature vector to each document data can be considered 
by the vector generating means 20. 

[0020]First, the sentence "C A carries out B" is considered about the relation of dependency, 
for example. In this sentence, the relation of the dependency of "C A carrying out" and "C 
Carrying out B" exists. Although it may identify including these to a rank, a rank being 
disregarded, and "A->C", "B->C", or a direction also being disregarded, and considering it as 
"A&C" and "B&C" (it considers that "A->C" and "C->A" are the same) is also considered. The 
appearance frequency of such dependency will be used as said TFij concerning the relation 
between the keywords in this case, or DFj. 

[0021]As an example of the distance in the case of on the other hand using what has a near 
distance between keywords as a relation between keywords, the number of characters between 
keywords, the number of morphemes, the number of clauses, the number of sentences, the 



number of paragraphs, etc. can be considered, for example. The case where a direction is 
considered also in this case may not be considered. As said TF'j concerning the relation between 
the keywords in this case, or DFj, appearance frequency when this distance is smaller than a 
user designated value will be used, for example. 

[0022]lt is also possible to perform the classification reflecting the category of keywords, such 
as synonymous-words-related. Namely, if it explains taking the case of the relation of 
dependency, supposing the keyword aO and a1 belong to the category A and aO and a1 have the 
relation between b and dependency, The keyword aO, the dimension about a1, "aO", "aO->b". 
"a1", and "a1->b" can be summarized to "A" and "A->b w , and a feature vector can also be 
generated. 

[0023]If neither of each dimension of the feature vector used as the comparison object in the 
case of a classification is a non-zero ingredient (coincidence), it will not contribute to similarity. 
However, generally, since coincidence of the relation between keywords becomes low probable 
rather than coincidence of a mere keyword, compared with a keyword, there is a tendency for 
the contribution to the similarity of the relation between keywords to become low. Then, balance 
of contribution to both similarity evaluation can be aimed at by making dignity of the dimension 
about the relation between keywords larger than the dimension of a keyword. 
[0024]It is also possible for a user to select a keyword, to enlarge dignity of the dimension of the 
relation between the keywords in which the dimension of the keyword and its keyword are 
contained, and to perform the classification which thought as important the keyword which a 
user observes. 

[0025]Embodiment 2. drawing 2 is a block diagram showing the example of composition of the 
information retrieval device about search which are other examples of this invention, and the 
document data base 1 0 and the output means 40 have the same function as drawing 1 . 
[0026]Have the search formula input means 50 and the function to input a search condition as a 
search formula (the text expressing a search formula may be sufficient.) the vector generating 
means 20, About the whole sentence document stored in the document data base 10, generate a 
feature vector from the frequency of occurrence of the relation between keywords according to 
the feature vector formula described by Embodiment 1, for example, and. It has a function which 
generates a feature vector also from the inputted search formula using the feature vector 
generation method described by Embodiment 1, and the same method. However, when generating 
a feature vector from a search formula, i of said V|j shall not show the document i and shall show 
a search formula. In this case, TFij is usually set to 1 . 

[0027]The search means 60 the similarity between the feature vector generated from the search 
formula, and the feature vector about the whole sentence document in the stored document 
data base 10, It calculates in the method described by the means of the classification of 
Embodiment 1, and a similar way, and similarity ranking evaluation of the document in a 
document data base is performed using the result 

[0028]The result which carried out ranking attachment can be outputted by the output means 
40. 

[0029]By the vector generating means 20, it is possible to use the relation which a keyword 
requires, the number of things with a near distance between keywords, etc. the same [ with 
having stated by Embodiment 1 ] as an example of the related extraction between the keywords 
at the time of generating a feature vector to each document data. 
[0030]It is also possible to perform search reflecting the category of keywords, such as 
synonymous-words-related. How to the category of the relation between keywords to collect is 
the same as that of illustration of Embodiment 1. 

[0031]By what dignity of the dimension of the relation between specific keywords is enlarged for. 
It is also possible to aim at balance of the dignity of each dimension of the relation between 
keywords, or to perform search which thought as important the relation between the keywords 
which a user observes like illustration by Embodiment 1. 
[0032] 

[Effect of the Invention]The document data base with which the information retrieval device 
concerning this invention stores two or more document data, In the information retrieval device 



which has a vector generating means which generates a feature vector to each document data, a 
sorting means which calculates the similarity between feature vectors and classifies each 
document data, and an output means which outputs the classification result of document data, A 
vector generating means analyzes each document data respectively, extracts the relation 
between keywords, and generates a feature vector based on both frequency of occurrence of 
these. Therefore, in a classification of document data, the similarity calculation not only 
reflecting the keyword of each document data but the relation between keywords becomes 
possible, and accuracy improves. 

[0033]The document data base with which the information retrieval device concerning this 
invention stores two or more document data, The search formula input means which inputs a 
search formula, and the vector generating means which generates a feature vector to each 
document data and search formula, In the information retrieval device which has a similarity 
calculation means to calculate the similarity between the feature vector to a search formula, and 
the feature vector to each document data, and an output means which outputs the document 
data which has a high feature vector of similarity, A vector generating means analyzes each 
document data and a search formula respectively, extracts the relation between keywords, and 
generates a feature vector based on these frequencies of occurrence. Therefore, in search of 
the document data near a search formula, the similarity calculation not only reflecting the 
keyword which appears in a search formula and each document data but the relation between 
keywords becomes possible, and the accuracy of search improves. 

[0034]The relation of dependency is used for a vector generating means as a relation between 
keywords. Therefore, in a classification of document data or search of the document data near a 
search formula, similarity calculation reflecting the relation of dependency is performed and the 
accuracy of a classification or search improves compared with the conventional method only 
using a keyword. 

[0035]It is used for a vector generating means as a relation between keywords that the distance 
between keywords is near. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation reflecting the distance between 
keywords is performed, and the accuracy of a classification or search improves compared with 
the conventional method only using a keyword. 

[0036]Instead of the frequency of occurrence of the relation between the keywords in which a 
vector generating means contains the keyword contained in the keyword group belonging to the 
same category, or it What added those frequencies of occurrence, respectively is used as the 
frequency of occurrence of the relation between the keywords containing the keyword or it 
representing the category. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation which summarized the relation 
between the keywords which do not need distinguishing for a user can be performed. As a result, 
it becomes possible to attain the increase in efficiency of highly precise classification and 
search. 

[0037]A vector generating means generates a feature vector based on weighting which a user 
specifies to the frequency of occurrence of the relation between keywords. Therefore, in a 
classification of document data or search of the document data near a search formula, a user's 
intention is reflected and similarity calculation which thought as important or made light of the 
relation between specific keywords can be performed. As a result, highly precise-ization of a 
classification and search in the form which reflected a user's intention better is attained. 
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[0037]A vector generating means generates a feature vector based on weighting which a user 
specifies to the frequency of occurrence of the relation between keywords. Therefore, in a 
classification of document data or search of the document data near a search formula, a user's 
intention is reflected and similarity calculation which thought as important or made light of the 
relation between specific keywords can be performed. As a result, highly precise-ization of a 
classification and search in the form which reflected a user's intention better is attained. 



[Translation done.] 



* NOTICES * 



JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



TECHNICAL FIELD 



[Field of the Invention]This invention relates especially document data to the information 
retrieval device which carries out classification and search automatically about a classification 
and search of the electronized document data. 
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TECHNICAL PROBLEM 



[Problem(s) to be Solved by the Invention]In the conventional information retrieval device of 
such composition, therefore the dependency of words and phrases, etc. had not carried out the 
classification or search reflecting the relation between keywords, a high-precision classification 
or search was not able to be carried out. 

[0005]This invention was made in order to solve above SUBJECT, and it makes possible 
similarity calculation not only reflecting a keyword but the relation between keywords, and an 
object of an invention is to obtain the information retrieval device which can improve the 
accuracy of a classification or search. 
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MEANS 



[Means for Solving the Problem]A document data base with which an information retrieval device 
concerning this invention stores two or more document data, In an information retrieval device 
which has a vector generating means which generates a feature vector to each document data, a 
sorting means which calculates similarity between feature vectors and classifies each document 
data, and an output means which outputs a classification result of document data, A vector 
generating means analyzes each document data respectively, extracts a relation between 
keywords, and generates a feature vector based on both frequency of occurrence of these. 
[0007]A document data base with which an information retrieval device concerning this invention 
stores two or more document data, A search formula input means which inputs a search formula, 
and a vector generating means which generates a feature vector to each document data and 
search formula, In an information retrieval device which has a similarity calculation means to 
calculate similarity between a feature vector to a search formula, and a feature vector to each 
document data, and an output means which outputs document data which has a high feature 
vector of similarity, A vector generating means analyzes each document data and a search 
formula respectively, extracts a relation between keywords, and generates a feature vector 
based on these frequencies of occurrence. 

[0008]A relation of dependency is used for a vector generating means as a relation between 
keywords. 

[0009]It is used for a vector generating means as a relation between keywords that distance 
between keywords is near. 

[0010]Instead of the frequency of occurrence of a relation between keywords in which a vector 
generating means contains a keyword contained in a keyword group belonging to the same 
category, or it. What added those frequencies of occurrence, respectively is used as the 
frequency of occurrence of a relation between keywords containing a keyword or it representing 
the category. 

[001 1]A vector generating means generates a feature vector based on weighting which a user 

specifies to the frequency of occurrence of a relation between keywords. 

[0012] 

[Embodiment of the Invention]Embodiment 1 . drawing 1 is a block diagram showing the example 
of composition of the information retrieval device about the classification of this invention. In a 
figure, the document data base 10 stores two or more document data. Each document data 
stored in the document data base 10 has text data at least 

[0013]The vector generating means 20 generates a feature vector to each document data. That 
is, conduct a morphological analysis etc. to the text data of each document data, perform 
unnecessary word processing etc. if needed, and a keyword is extracted, and the relation 
between keywords is extracted. 

[0014]Next, when R relations between K keywords are extracted from the document data base 
10 whole which consists of N documents, the feature vector Vi of each document i (1 <=K=N) is 
expressed with the vector of a K+R dimension, for example. When the index of the relation 
between keywords is expressed with j (1 <=j<=K+R), according to the tf-idf method, the 
ingredient Vy of each dimension j of the feature vector Vi can be computed by the following 



formula, for example. 
[0015]Vu=TFu*log(N/DFj) 

[0016]Here, TFij is [ be / it / under / document i / setting ] the number of times in which the 
relation between the keywords corresponding to j ingredient appears, and DFj is the number of 
times in which the relation between the keywords corresponding to j ingredient appears in N 
whole sentence in the letter of the document data base 10. Thus, a feature vector is generated. 
[0017]In the sorting means 30, the similarity between documents is calculated and a document is 
clustered using the result The similarity between documents is calculable with the cosine value 
of the angle between each feature vector of two or more sentence document computed as 
mentioned above, for example. By using the similarity calculated about clustering using the 
feature vector generated as mentioned above to similarity calculation required for clustering 
algorithms, such as K method of averaging. Clustering not only reflecting the keyword used by 
the conventional clustering but the relation between keywords is attained. 
[0018]The classified result can be outputted by the output means 40. 

[0019]Here, what have the relation of the dependency obtained as a result of syntax analysis and 
the distance between keywords near as an example of the related extraction between the 
keywords at the time of generating a feature vector to each document data can be considered 
by the vector generating means 20. 

[0020]First, the sentence "C A carries out B" is considered about the relation of dependency, 
for example. In this sentence, the relation of the dependency of "C A carrying out" and "C 
Carrying out B" exists. Although it may identify including these to a rank, a rank being 
disregarded, and "A->C", "B->C", or a direction also being disregarded, and considering it as 
"A&C and "B&C" (it considers that "A->C" and "C->A" are the same) is also considered. The 
appearance frequency of such dependency will be used as said TFij concerning the relation 
between the keywords in this case, or DFj. 

[0021]As an example of the distance in the case of on the other hand using what has a near 
distance between keywords as a relation between keywords, the number of characters between 
keywords, the number of morphemes, the number of clauses, the number of sentences, the 
number of paragraphs, etc. can be considered, for example. The case where a direction is 
considered also in this case may not be considered. As said TF(j concerning the relation between 
the keywords in this case, or DFj, appearance frequency when this distance is smaller than a 
user designated value will be used, for example. 

[0022]It is also possible to perform the classification reflecting the category of keywords, such 
as synonymous-words-related. Namely, if it explains taking the case of the relation of 
dependency, supposing the keyword aO and a1 belong to the category A and aO and a1 have the 
relation between b and dependency, The keyword aO, the dimension about a1, "aO", "a0->b", 
"a1", and "a1->b" can be summarized to "A" and "A->b", and a feature vector can also be 
generated. 

[0023]If neither of each dimension of the feature vector used as the comparison object in the 
case of a classification is a non-zero ingredient (coincidence), it will not contribute to similarity. 
However, generally, since coincidence of the relation between keywords becomes low probable 
rather than coincidence of a mere keyword, compared with a keyword, there is a tendency for 
the contribution to the similarity of the relation between keywords to become low. Then, balance 
of contribution to both similarity evaluation can be aimed at by making dignity of the dimension 
about the relation between keywords larger than the dimension of a keyword. 
[0024]It is also possible for a user to select a keyword, to enlarge dignity of the dimension of the 
relation between the keywords in which the dimension of the keyword and its keyword are 
contained, and to perform the classification which thought as important the keyword which a 
user observes. 

[0025]Embodiment 2. drawing 2 is a block diagram showing the example of composition of the 
information retrieval device about search which are other examples of this invention, and the 
document data base 1 0 and the output means 40 have the same function as drawing 1 . 
[0026]Have the search formula input means 50 and the function to input a search condition as a 
search formula (the text expressing a search formula may be sufficient.) the vector generating 



means 20, About the whole sentence document stored in the document data base 1 0, generate a 
feature vector from the frequency of occurrence of the relation between keywords according to 
the feature vector formula described by Embodiment 1 , for example, and. It has a function which 
generates a feature vector also from the inputted search formula using the feature vector 
generation method described by Embodiment 1 , and the same method. However, when generating 
a feature vector from a search formula, i of said Vjj shall not show the document i and shall show 
a search formula. In this case, TFu is usually set to 1. 

[0027]The search means 60 the similarity between the feature vector generated from the search 
formula, and the feature vector about the whole sentence document in the stored document 
data base 10, It calculates in the method described by the means of the classification of 
Embodiment 1, and a similar way, and similarity ranking evaluation of the document in a 
document data base is performed using the result 

[0028]The result which carried out ranking attachment can be outputted by the output means 
40. 

[0029]By the vector generating means 20, it is possible to use the relation which a keyword 
requires, the number of things with a near distance between keywords, etc. the same [ with 
having stated by Embodiment 1 ] as an example of the related extraction between the keywords 
at the time of generating a feature vector to each document data. 
[0030]It is also possible to perform search reflecting the category of keywords, such as 
synonymous-words-related. How to the category of the relation between keywords to collect is 
the same as that of illustration of Embodiment 1. 

[0031]By what dignity of the dimension of the relation between specific keywords is enlarged for. 
It is also possible to aim at balance of the dignity of each dimension of the relation between 
keywords, or to perform search which thought as important the relation between the keywords 
which a user observes like illustration by Embodiment 1 . 
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EFFECT OF THE INVENTION 



[Effect of the Invention]The document data base with which the information retrieval device 
concerning this invention stores two or more document data, In the information retrieval device 
which has a vector generating means which generates a feature vector to each document data, a 
sorting means which calculates the similarity between feature vectors and classifies each 
document data, and an output means which outputs the classification result of document data, A 
vector generating means analyzes each document data respectively, extracts the relation 
between keywords, and generates a feature vector based on both frequency of occurrence of 
these. Therefore, in a classification of document data, the similarity calculation not only 
reflecting the keyword of each document data but the relation between keywords becomes 
possible, and accuracy improves. 

[0033]The document data base with which the information retrieval device concerning this 
invention stores two or more document data. The search formula input means which inputs a 
search formula, and the vector generating means which generates a feature vector to each 
document data and search formula, In the information retrieval device which has a similarity 
calculation means to calculate the similarity between the feature vector to a search formula, and 
the feature vector to each document data, and an output means which outputs the document 
data which has a high feature vector of similarity, A vector generating means analyzes each 
document data and a search formula respectively, extracts the relation between keywords, and 
generates a feature vector based on these frequencies of occurrence. Therefore, in search of 
the document data near a search formula, the similarity calculation not only reflecting the 
keyword which appears in a search formula and each document data but the relation between 
keywords becomes possible, and the accuracy of search improves. 

[0034]The relation of dependency is used for a vector generating means as a relation between 
keywords. Therefore, in a classification of document data or search of the document data near a 
search formula, similarity calculation reflecting the relation of dependency is performed and the 
accuracy of a classification or search improves compared with the conventional method only 
using a keyword. 

[0035]It is used for a vector generating means as a relation between keywords that the distance 
between keywords is near. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation reflecting the distance between 
keywords is performed, and the accuracy of a classification or search improves compared with 
the conventional method only using a keyword. 

[0036]Instead of the frequency of occurrence of the relation between the keywords in which a 
vector generating means contains the keyword contained in the keyword group belonging to the 
same category, or it, What added those frequencies of occurrence, respectively is used as the 
frequency of occurrence of the relation between the keywords containing the keyword or it 
representing the category. Therefore, in a classification of document data or search of the 
document data near a search formula, similarity calculation which summarized the relation 
between the keywords which do not need distinguishing for a user can be performed. As a result, 
it becomes possible to attain the increase in efficiency of highly precise classification and 
search. 
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PRIOR ART 



[Description of the Prior ArtjAbout a classification and search of the electronized document 
data, the information retrieval device shown in the former, for example, JP,11-110395A is 
proposed. In the information retrieval device proposed here, the synonymous frequency of 
occurrence is summarized, a feature vector is generated, the similarity between feature vectors 
is calculated, and each document data is classified. Attaching weighting to two or more words 
which have a synonymous relation, respectively is also proposed by this JP,1 1— 110395.A. 
[0003]While summarizing the synonymous frequency of occurrence in JP,1 0-1 98691 A, and 
generating a feature vector is indicated, the adjoining word pair in a document data base and the 
synonym pair are registered, for example, and using for calculation of a feature vector is 
indicated. 
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DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1] It is a block diagram showing the information retrieval device relevant to the 
classification of this invention. 

[Drawing 2] It is a block diagram showing the information retrieval device relevant to search of 
this invention. 
[Description of Notations] 

1 0 A document data base, 20 vector generating means, and 30 A sorting means, 40 output 
means, and 50 A search formula input means and 60 Similarity calculation means (search means). 
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LT4 t -7~FStf+-7-FfflOBBffS«lfflb, £ft 

10 eomsflKKS-^TfttK* h^fc&s-rs. ^© 

[0 0 3 4] Sfc ^Frt/feS^Stt* +-7-FM 

-7- Fo**fflv^fetsa5osstjt^»a^tiii©«[ 

20 ftMLtf*. 

[0 0 3 5] Sfc, h>l/fiil#«tt, +-7-FM 
vm&t. LT^-7- Ffg©l£8ttf jfi^tl 

[0 0 3 6] Sfc, «9hA&mmt* ®-*TdV 
ltmtZ*-7- K«Mc#*n«*-7- Ffe L< tt* 

30 ©*7 L rf I J^S1-3*-7-FtL<}i^n^tf+ 
-7- FM©M«©(fiSygfti: LT^n5cDWS«S^ 

mctoxmts c t!&cFg&*-7- F*«^tt* 

-7- FH©l»ffifcS t »fc£iHffiH-**fr 3 <: fctf? 

[0 0 3 7] £P>lC, ^b!l±^mii, +-7-F 
ftff*-7-Flffl©Bfc©ftiSSfifc*fU 

- F*«^tt*-7- FP^©ffl«*sas5v^^±eaL 

ran znmtommmtmmmmm 
so [02] «:©aw©ttiifc:Baa-rsiffflttmsfli*^ 



t7uytm?&z>o *o asm 40 mam 50 tjwsxam 

10 iSr-^-X, 2 0 ^WfS, 3* 



tBl] [82] 




(72)f8BM <MQ «- (72)^# Wffl £- 

^H5^fflE^OrtZTg2S3^ S «RIIFFftBBK*.Ortz:TB2#3* 

F*-A(##) 5B075 ND03 NK02 NR12 PP23 PR04 
PR06 Q1I08 



